set.seed(12345)
source("ingest_data.R")
MODEL_CACHE <- "added-M_added_removed_lines"
Models are stored in the following directory, which must exist prior to knitting this document:
cat(normalizePath(paste(getwd(), dirname(cachefile(MODEL_CACHE)), sep="/"), mustWork = T))
## /home/app/.cache
The used cache directory can be controlled via the cache parameter to Rmd - it can be useful to experiment with this parameter if you Knit the document manually in RStudio.
Our next choice of model introduce two numerical predictors—added and removed lines—in addition to the existing categorical team and repository. The intercept and removed lines predictor is allowed to vary between teams, and between teams in repositories.
d <- data |> select(y=INTROD,
A=A,
C=C,
D=D,
R=R,
team=committerteam,
repo=repo)
formula <- bf(y ~ 1 + A + R + (1 + R | team) + (1 + R | team:repo),
zi ~ 1 + A + R + (1 + R | team) + (1 + R | team:repo))
get_prior(data=d,
family=zero_inflated_negbinomial,
formula=formula)
## prior class coef group resp dpar nlpar lb ub
## (flat) b
## (flat) b A
## (flat) b R
## lkj(1) cor
## lkj(1) cor team
## lkj(1) cor team:repo
## student_t(3, -2.3, 2.5) Intercept
## student_t(3, 0, 2.5) sd 0
## student_t(3, 0, 2.5) sd team 0
## student_t(3, 0, 2.5) sd Intercept team 0
## student_t(3, 0, 2.5) sd R team 0
## student_t(3, 0, 2.5) sd team:repo 0
## student_t(3, 0, 2.5) sd Intercept team:repo 0
## student_t(3, 0, 2.5) sd R team:repo 0
## gamma(0.01, 0.01) shape 0
## (flat) b zi
## (flat) b A zi
## (flat) b R zi
## logistic(0, 1) Intercept zi
## student_t(3, 0, 2.5) sd zi 0
## student_t(3, 0, 2.5) sd team zi 0
## student_t(3, 0, 2.5) sd Intercept team zi 0
## student_t(3, 0, 2.5) sd R team zi 0
## student_t(3, 0, 2.5) sd team:repo zi 0
## student_t(3, 0, 2.5) sd Intercept team:repo zi 0
## student_t(3, 0, 2.5) sd R team:repo zi 0
## source
## default
## (vectorized)
## (vectorized)
## default
## (vectorized)
## (vectorized)
## default
## default
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## default
## default
## (vectorized)
## (vectorized)
## default
## default
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
priors <- c(prior(normal(0, 0.5), class = Intercept),
prior(normal(0, 0.25), class = b),
prior(weibull(2, 0.25), class = sd),
prior(lkj(2), class = cor),
prior(normal(0, 0.5), class = Intercept, dpar=zi),
prior(normal(0, 0.25), class = b, dpar=zi),
prior(weibull(2, 0.25), class = sd, dpar=zi),
prior(gamma(0.5, 0.1), class = shape))
(v <- validate_prior(prior=priors,
formula=formula,
data=d,
family=zero_inflated_negbinomial)
)
## prior class coef group resp dpar nlpar lb ub
## normal(0, 0.25) b
## normal(0, 0.25) b A
## normal(0, 0.25) b R
## normal(0, 0.25) b zi
## normal(0, 0.25) b A zi
## normal(0, 0.25) b R zi
## normal(0, 0.5) Intercept
## normal(0, 0.5) Intercept zi
## lkj_corr_cholesky(2) L
## lkj_corr_cholesky(2) L team
## lkj_corr_cholesky(2) L team:repo
## weibull(2, 0.25) sd 0
## weibull(2, 0.25) sd zi 0
## weibull(2, 0.25) sd team 0
## weibull(2, 0.25) sd Intercept team 0
## weibull(2, 0.25) sd R team 0
## weibull(2, 0.25) sd team zi 0
## weibull(2, 0.25) sd Intercept team zi 0
## weibull(2, 0.25) sd R team zi 0
## weibull(2, 0.25) sd team:repo 0
## weibull(2, 0.25) sd Intercept team:repo 0
## weibull(2, 0.25) sd R team:repo 0
## weibull(2, 0.25) sd team:repo zi 0
## weibull(2, 0.25) sd Intercept team:repo zi 0
## weibull(2, 0.25) sd R team:repo zi 0
## gamma(0.5, 0.1) shape 0
## source
## user
## (vectorized)
## (vectorized)
## user
## (vectorized)
## (vectorized)
## user
## user
## user
## (vectorized)
## (vectorized)
## user
## user
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## user
Relative to the intercept-only model, we now need additional prior information. The article talks about scaling and centering the numerical predictors, which allow us to choose a beta for the slope in a unit-agnostic way. We settle for a \(\mathcal{N}(0, 0.25)\) distribution.
Likewise, we need a prior for the correlation between slopes and intercepts. We use the standard recommended, and moderately strong \(LKJ(2)\) prior, which is mildly sceptical of extreme (nearing 0 or 1) correlations between slopes and intercepts.
M_ppc <- brm(data = d,
family = zero_inflated_negbinomial,
formula = formula,
prior = priors,
file = cachefile(paste0("ppc-",MODEL_CACHE)),
warmup = 1000,
iter = ITERATIONS,
chains = CHAINS,
cores = CORES,
sample_prior = "only",
backend="cmdstanr",
file_refit = "on_change",
threads = threading(THREADS),
save_pars = SAVE_PARS,
adapt_delta = ADAPT_DELTA)
m <- M_ppc
yrep <- posterior_predict(m)
ppc_stat(y = d$y, yrep, stat = function(y) mean(y == 0)) + ggtitle("Prior predicted proportion of zeros")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The overall shape of the prior predicted proportion of zeros are similar to the intercept-only model, but this model is slightly “flatter at the top”. The model expects anything from \(40%\) to \(100%\) zeros in the data, with the most likely proportion between around \(75%--85%\), slightly below the observed value of \(96%\).
The maximum predicted values are very similar to the intercept-only model, still ranging from the low single digits, up to aobut 150,000, or even 200,000. The higher ends are unrealistic, but also very unlikely in our model.
(sim_max <- ppc_stat(y = d$y, yrep, stat = "max") + ggtitle("Prior predicted max values")
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Scaling the prior predictions to more reasonable values show that our model is similar to the intercept-only model, still expects somewhat smaller max values than observed, but is also somewhat more likely to expect values ranging up to 1000 (the histogram there is higher than in the corresponding figure for \(\mathcal{M}_0\).
sim_max + scale_x_continuous(limits = c(0,1000)) + ggtitle("Prior predicted max values up to 1000")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The 99th percentile is a also a good fit for our model.
ppc_stat(y = d$y, yrep, stat = "q99") + ggtitle("Prior predicted Q99 vs. observed value")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Relative to the intercept-only model, the shape is the same, but there are a somewhat wider range of values (up to about 100 for q99, and 30 for q95).
(p <- ppc_stat_2d(d$y, yrep, stat = c("q95", "q99")) + theme(legend.position="bottom") + ggtitle("Prior predicted Q95 vs Q99")
)
Same situation for the standard deviations—slightly wider range of standard deviations, only considering the priors.
(p <- ppc_stat(y = d$y, yrep, stat = "sd") + ggtitle("Prior predicted stddev vs. observed value")
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Zooming in to show the distribution in more detail. Relative to the intercept model, this model has a slightly wider standard deviation range (more histogram values over 10).
p + scale_x_continuous(limits=c(0,30))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The group-level data is more spread out (Q99 varies more than for the intercept-only model). But the observed values are all covered by the group predictions.
ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$team) + theme(legend.position = "bottom") + ggtitle("Prior predictive Q99 observation per team")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Similar settings apply for the repo-level variations.
ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$repo) + theme(legend.position = "bottom") + ggtitle("Prior predictive Q99 observation per repository")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
M_added_removed_lines <- brm(data = d,
family = zero_inflated_negbinomial,
file = cachefile(MODEL_CACHE),
formula = formula,
prior = priors,
warmup = 1000,
iter = ITERATIONS,
chains = CHAINS,
cores = CORES,
backend="cmdstanr",
file_refit = "on_change",
threads = threading(THREADS),
save_pars = SAVE_PARS,
adapt_delta = ADAPT_DELTA)
M_added_removed_lines <- add_criterion(M_added_removed_lines, "loo")
m <- M_added_removed_lines
p <- mcmc_trace(m)
pars <- levels(p[["data"]][["parameter"]])
plots <- seq(1, to=length(pars), by=12)
lapply(plots, function(i) {
start <- i
end <- start+11
mcmc_trace(m, pars = na.omit(pars[start:end]))
})
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
##
## [[15]]
##
## [[16]]
##
## [[17]]
##
## [[18]]
##
## [[19]]
##
## [[20]]
##
## [[21]]
##
## [[22]]
##
## [[23]]
##
## [[24]]
##
## [[25]]
##
## [[26]]
##
## [[27]]
##
## [[28]]
##
## [[29]]
##
## [[30]]
##
## [[31]]
##
## [[32]]
##
## [[33]]
##
## [[34]]
The “caterpillar plots” also show that the chains mixed well.
mcmc_plot(m, type="rhat")
mcmc_plot(m, type="rhat_hist")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
mcmc_plot(m, type="neff")
mcmc_plot(m, type="neff_hist")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Both MCMC chains, \(\hat{R}\) and \(N_{eff}\) ratio looks good.
loo <- loo(m)
loo
##
## Computed from 12000 by 31007 log-likelihood matrix.
##
## Estimate SE
## elpd_loo -8696.8 167.6
## p_loo 173.7 13.3
## looic 17393.6 335.1
## ------
## MCSE of elpd_loo is NA.
## MCSE and ESS estimates assume MCMC draws (r_eff in [0.7, 2.1]).
##
## Pareto k diagnostic values:
## Count Pct. Min. ESS
## (-Inf, 0.7] (good) 30998 100.0% 188
## (0.7, 1] (bad) 6 0.0% <NA>
## (1, Inf) (very bad) 3 0.0% <NA>
## See help('pareto-k-diagnostic') for details.
plot(loo)
There are 9 outlier points that need to be analysed (e.g. with reloo)
Reloo is disabled by default, but could be enabled by setting the RELOO environment variable (which in turn will set the reloo parameter of this document.
reloofile <- cachefile(paste0("reloo-", MODEL_CACHE, ".rds"))
if (file.exists(reloofile)) {
(reloo <- readRDS(reloofile))
} else {
Sys.time()
(reloo <- reloo(m, loo, chains=CHAINS, cores=CORES) )
Sys.time()
saveRDS(reloo, reloofile)
}
##
## Computed from 12000 by 31007 log-likelihood matrix.
##
## Estimate SE
## elpd_loo -8698.6 167.7
## p_loo 175.6 13.9
## looic 17397.3 335.5
## ------
## MCSE of elpd_loo is 0.5.
## MCSE and ESS estimates assume MCMC draws (r_eff in [0.7, 2.1]).
##
## All Pareto k estimates are good (k < 0.7).
## See help('pareto-k-diagnostic') for details.
Note that the reloo results are cached, and included in the docker image by default. If you want to reproduce the reloo manually, set a separate cache dir (and mount it into the docker image), and enable the RELOO environment variable. It will take several hours, so it is best done over night.
# plotting the recalculated values
plot(reloo)
# which points have higher pareto_k than 0.5?
influencers <- data[loo::pareto_k_ids(reloo),]
influencers |> group_by(repo,committerteam) |> tally()
## # A tibble: 0 × 3
## # Groups: repo [0]
## # ℹ 3 variables: repo <fct>, committerteam <fct>, n <int>
According to the reloo function, all Pareto-k values are below 0.7, so we should be able to trust the model inferences.
yrep <- posterior_predict(m)
ppc_stat(y = d$y, yrep, stat = function(y) mean(y == 0))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of zeros is marginally higher than the observed value, but it still encompasses the observed proportion.
(sim_max <- ppc_stat(y = d$y, yrep, stat = "max")
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Relative to the intercept-only model, our model now expects a wider range of values, up to a few thousand duplicates. But the most likely value is still close to the observed max value (150).
sim_max + scale_x_continuous(limits = c(0,1000))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
When scaling to \(0--1000\), we see that the most likely prediction of our model is actually somewhat less than 150 (observed max value). But then the predictions drop off towards 750, or even 1000.
ppc_stat(y = d$y, yrep, stat = "sd")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Same principle as for the other metrics—the standard deviation is somewhat wider than for the intercept-only model.
ppc_stat(y = d$y, yrep, stat = function(y) quantile(y, 0.99)) + ggtitle("Posterior predicted Q99")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ppc_stat_2d(d$y, yrep, stat = c("q95", "q99")) + ggtitle("Posterior predicted Q95 vs Q99")
The 95th and 99th quantiles are spot on the observed values of 1 and 5, respectively. Still, small variations are allowed.
ppc_stat_grouped(y = d$y, yrep, stat = "max", group = d$team) + ggtitle("Posterior predictive max observation per team")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The max value per team varies between teams, but for most teams the observed value fall reasonably well within the predictions. Red team is the exception. The maximum observation is a volatile metric for right-skewed data like this.
ppc_stat_grouped(y = d$y, yrep, stat = "max", group = d$repo) + ggtitle("Posterior predictive max observation per repo")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We see that overall, the model predicts more duplicates in the integration test repository (just like the data says).
ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$team) + ggtitle("Posterior predictive 99% quartile per team")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The 99th percentile value predictions seem very well fitted. Predictions surround the observation nicely, and the uncertainty about the low-contributing UI team is also visible.
ppc_stat_grouped(y = d$y, yrep, stat = "q99", group = d$repo) + ggtitle("Posterior predictive 99% quartile per repo")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The same thing applies to the 99th percentile per repository.
In a rootogram, predictions are placed on the x-axis, and the square root of the expected frequency is placed on the y axis. In the suspended style, the difference is shown as a bar chart protuding from the \(x\)-axis, for each predicted value.
(rootogram <- pp_check(m, type = "rootogram", style="suspended")
)
## Using all posterior draws for ppc type 'rootogram' by default.
The full scale rootogram does not say much, we need to scale it to reasonable values.
rootogram + scale_x_continuous(limits=c(0,50)) + ggtitle("Suspended rootogram, scaled to 0-50 prediction interval")
## Warning: Removed 7131 rows containing missing values or values outside the scale range
## (`geom_tile()`).
## Warning: Removed 7129 rows containing missing values or values outside the scale range
## (`geom_smooth()`).
source("predict_utils.R")
# the predict functions use all predictors, so we have to supply them, even though this model only considers added and removed lines
heatmap_by_team_and_repo(posterior_predict_by_team_and_repo(m, added=q95(data$ADD), removed=q95(data$DEL), complexity=q95(data$COMPLEX), duplicates = q95(data$DUP), summary=function(x) q99(x)), "Quantile 99%", decimals=0) + ggtitle("Quantile 99% per team and repo")
Relative to the intercept-only model, this model show more team-level variations (meaning that the team tendency of introducing duplicates, relative to their added and removed lines, is considered). For instance, Architects are now low ranked, while Brown and Orange are high ranked of the common teams. The more sporadic teams, UI, Pink and Unknown are also expected to introduce more duplicates overall.
Our model converges well, but is limited in its predictive capabilities. We will use it for comparisons with the full model.